Random forestでmultilabel

RandomForestClassifier

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

make_multilabel_classification

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_multilabel_classification.html#sklearn.datasets.make_multilabel_classification

code:python

>> from sklearn.datasets import make_multilabel_classification

>> from sklearn.ensemble import RandomForestClassifier

>> from sklearn.model_selection import train_test_split

>> X, Y = make_multilabel_classification(n_samples=12, n_classes=3, random_state=0)

>> X.shape, Y.shape

((12, 20), (12, 3)) # Xは(n_samples, n_features)、Yは(n_samples, n_outputs)

>> X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=0)

>> X_train.shape, X_test.shape

((9, 20), (3, 20))

>> clf = RandomForestClassifier(max_depth=2, random_state=0)

>> clf.fit(X_train, Y_train)

RandomForestClassifier(max_depth=2, random_state=0)

>> clf.predict(X_test) # 返り値は(n_samples, n_features)

array([1, 1, 0,

1, 1, 0,

0, 1, 0])

>> Y_test

array([0, 0, 0,

0, 0, 0,

0, 1, 0])

>> clf.score(X_test, Y_test) # subset accuracy (?) 3サンプルのうち、1サンプルだけ合致ということ？

0.3333333333333333

>> # 長さ3のリスト(n_outputs=3)。classes_属性の順番

>> # 要素はarray（outputに対応）で、shapeは(3,2) = (n_samples, nega/posi) naga+posi=1

>> # しきい値0.5でpredict（スコアがnega > postなら0。逆なら1）

>> clf.predict_proba(X_test)

[array([0.386 , 0.614 , # output1 1,1,0 (predictを **縦** に見ている)

0.49766667, 0.50233333,

0.5455 , 0.4545 ]), array([0.28416667, 0.71583333, # output2 1,1,1

0.37683333, 0.62316667,

0.1515 , 0.8485 ]), array([0.538 , 0.462 , # output3 0,0,0

0.56183333, 0.43816667,

0.68916667, 0.31083333])]

>> Y_train

array([0, 1, 0,

1, 1, 0])

>> clf.classes_

[array(0, 1), array(0, 1), array(0, 1)]

>> clf.n_outputs_

>> clf.n_features_

>> clf.feature_importances_

array([0.09609789, 0.04004824, 0.08110294, 0.07757968, 0.0658578 ,

0.03856713, 0.02071324, 0.0548202 , 0.04409759, 0.02824918,

0.05900911, 0.00996354, 0.02613438, 0.07684385, 0.01227006,

0.01527436, 0.09345899, 0.07768832, 0.03305281, 0.04917069])

>> len(clf.estimators_)

100

multilabelにおけるpredict_proba の扱い

動機：しきい値を0.5から変えたい

方法は下のコード参照

predictの実装 (v0.23.2)を参考にする

predictはクラス（整数）を返すので、dtypeを指定

clf.classes_[k]がlistで要素はnumpy.array

takeを使って、スコアのうち大きい方のindexを取る

例：[0.386 , 0.614 ]

大きい方のインデックスは1

clf.classes_[0]はarray([0, 1])なので、takeでスコアが大きい方のクラスが取れる

clf.classes_[0].take([1]) -> array([1])

code:python

>> proba = clf.predict_proba(X_test) # 長さがn_outputs_、要素が(n_samples, n_classes)

>> n_samples = proba0.shape0

>> positive_scores = np.empty((n_samples, clf.n_outputs_))

>> for k in range(clf.n_outputs_):

... positive_scores:, k = probak:, 1

>> positive_scores

array([0.614 , 0.71583333, 0.462 ,

0.50233333, 0.62316667, 0.43816667,

0.4545 , 0.8485 , 0.31083333])

>> (positive_scores >= 0.5).astype(int) # predict相当

array([1, 1, 0,

1, 1, 0,

0, 1, 0])

multilabelの混同行列（MCM）

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.multilabel_confusion_matrix.html#sklearn.metrics.multilabel_confusion_matrix

code: python

>> from sklearn.metrics import multilabel_confusion_matrix

>> multilabel_confusion_matrix(Y_test, clf.predict(X_test)) # class-wise

array([[1, 2, # tn, fp # 1クラス目は tn=1, fp=2

0, 0], # fn, tp

[0, 2, # 2クラス目はtp=1, fp=2

0, 1],

[3, 0, # 3クラス目はtn=3

0, 0]])

>> multilabel_confusion_matrix(Y_test, clf.predict(X_test), samplewise=True)

array([[1, 2, # 1サンプル目はtn=1, fp=2

0, 0],

[1, 2, # 2サンプル目もtn=1, fp=2

0, 0],

[2, 0, # 3サンプル目はtn=2, tp=1

0, 1]])